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ABSTRACT 

A  supervised  learning  technique,  the  Attribute  Importance  Measure  (AIM)  method,  is  proposed 
for  the  classification  of  objects  with  categorical  attributes.  The  advantage  of  this  method  over 
existing  techniques  is  its  ability  to  perform  classification  and  dimensionality  reduction,  or  feature 
selection,  with  the  same  algorithm.  The  method  uses  probabilistic  measures  alongside  logical 
concepts  of  sufficiency,  necessity  and  irrelevance  in  providing  corresponding  weights  to  values  in 
attribute  value  pairs.  Finally  an  efficient  search  algorithm  is  developed  which  generates  decision 
rules  for  classification.  The  performance  of  the  new  method  is  demonstrated  on  a  commonly  used 
machine  learning  data  set. 
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A  Logical  and  Probabilistic  Technique  for 
Classification  and  Dimensionality  Reduction 
for  Objects  with  Categorical  Data 

Executive  Summary 

The  paper  presents  a  novel  method  for  classification,  dimensionality  reduction  and  rule 
discovery  for  data  with  categorical  attributes.  These  three  areas  of  interest  in  data  mining 
are  often  conducted  using  different  algorithms,  while  the  new  Attribute  Importance 
Measure  technique  presented  here  is  able  to  conduct  all  of  these  operations  in  the  same 
algorithm.  The  AIM  method  uses  a  probabilistic  approach,  similar  to  that  of  the  Naive 
Bayes  algorithm,  with  additional  logical  concepts  of  sufficiency  and  necessity.  The  goal  of 
this  paper  is  to  present  the  new  method  and  demonstrate  its  application,  rather  than  to 
perform  an  extensive  comparison  with  existing  techniques. 
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Ms{xki,mj) 

MN(xki,mj) 

My  {Xki  ) 

MA*t) 


Notation 

number  of  classes 
j{h  class 

number  of  attributes  of  each  object 
an  object 

an  attribute  of  an  object 
number  distinct  values  for  the  attribute  xk 
a  value  of  the  /rth  attribute 
measure  of  sufficiency  for  xkj  being  in  class  m  . 
measure  of  necessity  for  class  m  - ,  given  re¬ 
measure  of  usefulness  of  attribute  value  xki  (across  all  S  classes) 
measure  of  usefulness  of  attribute  xk  (for  all  attribute  values  in  xk 
and  across  all  S  classes) 

vector  of  g  attribute  values.  Eg.  x3  =  {jc,  , ,  x25 ,  r32 } 
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1.  Introduction 


This  paper  presents  a  new  algorithm  for  dealing  with  the  supervised  learning  problem  of 
classifying  objects  with  categorical,  or  nominal,  data.  It  is  also  concerned  with  the 
problems  of  dimensionality  reduction,  or  feature  selection,  and  rule  generation  for 
categorical  data.  The  motivation  for  this  work  is  a  large  dataset  currently  being 
investigated  in  ISRD,  most  of  whose  fields  contain  such  categorical  data. 

Analysis  of  categorical  data  restricts  the  number  of  existing  classification  algorithms  that 
may  be  used.  Many  parametric  methods  (James,  1985),  rely  on  continuous  data  and  are 
inappropriate.  This  paper  addresses  pure  categorical,  or  nominal  data,  not  multinominal 
data,  which  is  simply  binned  segments  of  the  continuous  scale.  A  common  approach  for 
such  data  is  to  code  each  categorical  value  as  a  dummy  variable.  That  is,  if  there  is  a 
variable  "colour"  whose  values  are  "red",  "green"  or  "blue",  then  the  following  new 
variables  might  be  created:  "red"={0,l},  "green"={0,l}  and  "blue"={0,l}.  This  approach  is 
useful  because  the  new  variables  can  now  be  taken  as  quasi-continuous,  and  therefore 
used  in  traditional  machine  learning  algorithms.  However,  for  cases,  such  as  the  one 
considered  in  ISRD,  where  the  dataset  is  very  large,  and  the  number  of  categorical  values 
exceedingly  high,  such  an  approach  becomes  unworkable  as  the  dimensionality  becomes 
large. 

There  are  algorithms  which  accommodate  categorical  data  without  the  need  for  the 
recoding  described  above,  some  of  which  are  discussed  here.  Nearest-neighbour 
classification  algorithms  (k-NN)  (Duda  et  al.,  2001)  can  perform  classification  of  a  data 
point  based  on  the  training  set,  but  is  unable  to  generate  decision  rules.  Dimensionality 
reduction  is  also  possible  with  this  algorithms,  but  it  is  conducted  through  an  iterative 
process  whereby  a  subset  of  features  is  selected,  and  the  accuracy  of  classification  using 
those  features  is  used  as  a  metric  for  comparing  different  subsets  of  features.  This  type  of 
procedure  is  more  computationally  intensive  than  single  step  solutions,  such  as  the  use  of 
the  Gini,  entropy  or  Chi-squared  measures. 

Such  measures  are  used  in  decision  tree  methods  (Hastie  et  al.,  2001)  which  are  also 
amenable  to  categorical  data.  There  are  many  decision  tree  algorithms  in  the  literature,  for 
example  Quinlan  (1986),  Lim  (2000)  and  Frank  (1998),  and  they  are  easily  applicable  to  the 
dataset  being  considered  in  ISRD.  However,  there  is  an  amount  of  inflexibility  in  rule 
generation  using  these  methods.  When  descending  a  decision  tree,  from  the  root  to  a  leaf, 
each  node  is  positioned  based  on  some  metric,  such  as  the  Gini  measure,  as  given  above. 
This  provides  the  implicit  feature  selection  of  the  algorithm.  Decision  rules  are  then 
formed  by  amalgamating  decisions  at  each  branch  through  to  a  leaf  node.  However, 
variables  used  in  splits  close  to  the  root  are  based  on  the  usefulness  of  the  variable  over  all 
classes,  not  each  individual  one.  It  is  quite  probable  that  the  decision  rules  for  each  class 
may  not  rely  on  the  same  variables,  and  therefore  the  decision  rules  generated  through 
decision  trees  are  somewhat  limited. 
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Probabilistic  approaches  are  also  suited  to  classification  with  categorical  data.  The  classical 
probabilistic  approach  uses  the  Naive  Bayes  method  (Lewis,  1998  and  Mosteller  & 
Wallace,  1984),  and  it  is  this  method  which  is  the  most  similar  to  the  new  algorithm 
presented  in  this  paper.  Therefore,  classification  using  the  Naive  Bayes  approach  is 
described  further  below. 


Consider  the  problem  of  classifying  an  object,  X,  into  one  of  S  mutually  exclusive, 
exhaustive  classes,  s).  There  are  series  of  attributes  of  X,  ), 

whose  values  are  all  categorical.  The  traditional  approach  would  be  to  classify  X  into  the 
class  where  the  conditional  probability  P{tJJ  j  \  X)  is  maximised  (Hand,  1981,  p.5). 

Estimates  of  probabilities  are  calculated  from  the  training  set.  In  many  applications,  for 
example  textual  analysis  (Sebastiani,  2002),  the  vector  of  attributes  associated  with  X  can 
be  large,  causing  this  conditional  probability  to  be  non-trivial  to  calculate.  Therefore, 
Bayes'  Theorem  (Lewis,  1998)  is  used  to  put  this  probability  in  a  more  convenient  form. 


Pirnj  |  X)  = 


P(X  |  nr J)P(arj) 
P(X) 


(1) 


Maximising  P(m j  \  X)  is  effective  for  finding  the  class  to  which  X  most  likely  belongs. 
Exclusion  from  a  class  can  also  be  deduced  as  P(m .  |  X)  approaches  zero.  The  point  of 

changeover  between  likely  inclusion  and  likely  exclusion  of  an  object  from  a  class  is  rarely 
mentioned  in  literature  for  the  Bayesian  approach.  James  (1985,  pp.  70-72)  proposes  a  limit 
of  (1  - 1 )  to  P{vjj  |  X ) ,  below  which  an  object  is  excluded  from  classification  in  tUj .  The 

parameter  t  here  is  arbitrarily  chosen  by  the  user,  rather  than  being  a  rigorous  criterion. 
The  method  presented  in  this  paper  makes  use  of  a  more  rigorously  defined  changeover 
point,  and  it  will  be  shown  that  it  has  a  logical  significance  which,  when  combined  with 
other  logical  conditions,  leads  to  further  functionality  of  the  classification  algorithm. 


2.  Attribute  value  distributions  and  logical  conditions 

For  brevity  and  clarity,  let  us  consider  just  one  on  the  attributes  of  X,  xk  say,  which  has 
categorical  attribute  values  {xkl,...,xki,...,xkn  } ,  where  nk  is  the  number  of  distinct 
categorical  data  values  for  attribute  xk .  Figure  1  gives  an  example  contingency  table 
showing  the  distribution  of  these  attribute  values  amongst  three  classes  for  100  training 
objects. 
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Xk2 

xa 

ar, 

15 

0 

0 

m2 

5 

30 

0 

ar3 

20 

7 

23 

Figure  1  Example  data  for  nk  =  3  and  S  =  3. 

Firstly,  we  can  see  from  the  figure  that  all  of  the  occurrences  of  xk2  fall  in  class  cr3 . 
Therefore,  if  an  object  has  the  attribute  value  xk3 ,  that  is  sufficient  information  to  classify 
the  object  into  class  m3 .  Equally,  given  that  no  objects  with  attribute  value  xki  fall  into  cr, 
or  tu2 ,  it  is  sufficient  to  know  that  an  object  has  attribute  value  xk3  to  know  that  it  does  not 
fall  into  either  of  classes  m,  or  m2 .  This  example  shows  that  these  conditions  of  perfect 
sufficiency  occur  when  a  column  in  the  contingency  table  of  Figure  1  is  non-zero  in  just 
one  cell.  This  can  be  summarised  as  follows. 

Sufficient  for  inclusion  in  if  P{®  k  \  xki )  =  1  (2) 

Sufficient  for  exclusion  from  ar  if  P{rn }  \  xkl )  =  0  (3) 

Consider  now  the  attribute  values  which  occur  for  class  ar, .  Figure  1  shows  that  only  xkl 

falls  in  ar,,  but  that  objects  with  attribute  value  xkl  can  also  be  found  in  other  classes. 

Therefore,  it  is  necessary  for  an  object  to  have  attribute  value  xkx  to  be  in  class  mx ,  but  not 

sufficient.  The  negative  necessity  condition  is  exemplified  by  xk3  and  ar,  or  ar2 ,  where  it 

is  necessary  for  an  object  to  have  xkJ  for  it  not  to  be  in  either  class.  These  cases  show  that 

perfect  conditions  of  necessity  result  from  rows  in  the  contingency  table  of  Figure  1 
containing  only  one  non-zero  element.  More  specifically. 

Necessary  for  inclusion  in  ar;  if  P(xki  \  ary )  =  1  (4) 

Necessary  for  exclusion  from  ary  if  P(xki  \  nij )  =  0  (5) 

Note  that  Equations  3  and  5  are  logically  equivalent.  This  implies  that  a  condition  of 
negative  sufficiency  is  the  same  as  a  condition  of  negative  necessity.  That  is,  if  it  is 
necessary  that  an  object  not  have  attribute  value  xkj  to  be  in  class  ar; ,  then  this  is  sufficient 

information  to  know  that  an  object  that  has  xki  is  not  in  gj;  . 


Note  that  the  logical  conditions  of  sufficiency  and  necessity  are  not  mutually  exclusive. 
The  combined  condition  occurs  when  a  cell  contains  the  only  non-zero  value  in  its  row 
and  column.  In  this  case  there  is  direct  mapping  between  the  attribute  value  and  the  class. 
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Table  1  Logical  conditions  and  their  probabilistic  criteria 


Logical  condition 

Direction 

Criteria 

Ms(  Xki,tUj) 

MN(xki,mj) 

Sufficiency 

Inclusion 

P(xki 

*'  1  P(tUj ) 

1 

N/A 

Sufficiency  or 
Necessity 

Exclusion 

P(xki  1  &j)  =  0 

-1 

-1 

Necessity 

Inclusion 

P(xu\mJ)  =  1 

N/A 

1 

Irrelevance 

N/A 

P(xki  |  BTj)  =  P(xki) 

0 

0 

The  final  logical  condition  introduced  here  is  that  of  irrelevance.  If  P(ct7j  \  xki )  =  P(vj  j  ) , 
then  the  condition  that  an  object  has  attribute  value  xki  does  not  affect  the  classification  of 
the  object  into  class  w . .  This  is  the  definition  that  the  occurrence  of  xki  and  the 
classification  of  an  object  into  xu ,  are  independent  events.  Gennari  et  al.  (1989,  section  5.5) 
has  previously  expressed  irrelevance  in  these  terms.  In  Figure  1, 
P(ct3  |  xkl )  =  P(&3 )  =  0.5 ,  signifying  that  the  presence  of  xki  is  irrelevant  to  classification 

into  gt3  .  Note,  however,  that  this  conditional  probability  is  the  maximum  for  xk]  and  any 
of  the  classes,  and  would  therefore  have  been  the  optimal  classification  mechanism  under 
a  traditional  Naive  Bayes  scheme.  This  is  similar  to  the  Bayesian  constant  rule  problem 
(James,  1985,  pp.  159-160)  where  every  case  is  assigned  to  the  class  which  occurs  most 
often.  James  (1985)  notes  that  this  is  a  particular  problem  for  categorical  data.  By  defining 
a  point  of  irrelevance  the  new  method  avoids  this  problem. 

Bayes'  Theorem  is  used  to  express  the  criteria  above  in  terms  of  P(xki  \tUj).  The  resulting 

expressions  are  presented  in  Table  1.  The  rightmost  two  columns  of  the  table  will  be 
discussed  in  the  next  section. 


3.  Definition  of  Attribute  Importance  Measures 

3.1  Sufficiency  and  necessity  measures 

It  is  proposed  that  two  measures  be  introduced:  Ms  (xki ,  tu j  ) ,  which  gives  a  measure  of 
the  sufficiency  that  an  object  be  in  class  tUj  given  attribute  value  xki ;  and,  M N  (xkj ,  m j  ) , 
which  gives  a  measure  of  the  necessity  that  an  object  have  attribute  value  xki  to  be  in  class 
w  j .  The  measures  attain  their  maxima  of  ±  1  when  the  perfect  logical  conditions  of  Table 
1  are  achieved,  and  are  zero  for  the  condition  of  irrelevance  (as  presented  in  Table  1).  For 
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simplicity,  linear  interpolation  is  used  between  these  points,  as  shown  in  Figure  2, 
resulting  in  Equations  6  to  8  below.  Note  that  this  derivation  is  in  terms  of  a  single 
attribute  value:  the  method  will  be  applied  to  vectors  of  attribute  values  later  in  the  paper. 


?(xk\a)j) 


Figure  2  Ms (xki , rn  j )  and  MN{xki,m})  versus  P(xki  \  m j) 


Ms(xki,GJj)  = 


P(xki  \  nr .)  -  P(xki) 


p(xki) 

P&j) 


-PM 


if  P(xki  \mj)>  P(xki)  (6) 


X£  t  \  P(xu\mj)-P(.xu) 

MN(xki,n 7,)  = - - - 

1  -PM 


if  P{xki  |  tUj)>  P{xki)  (7) 


M  s  (xu  ,t*j)  =  MN  (xu  ,mj)  =  —  1  P(X;,)  if  P(xki  \ajj)<  P(xki )  (8) 

X\xki ) 


It  can  easily  be  shown  that  a  perfect  necessity  condition  is  unachievable  if  P(xki )  <  P{m j  ) , 
while  no  sufficiency  condition  can  be  obtained  if  P(xki )  >  P{tu j ) . 


Having  calculated  these  measures,  an  object  may  be  classified  into  a  class  where  the 
sufficiency  measure  is  highest  (and  positive).  This  is  analogous  to  Bayesian  classification 
where  the  conditional  probability  of  Equation  1  is  at  its  maximum. 
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3.2  The  Attribute  Value  Measure 

Equations  6  to  8  give  measures  of  the  usefulness  of  attribute  values  in  classifying  an  object 
into  a  particular  class.  A  simple  summation  scheme  can  now  reveal  the  classifying  power 
of  an  attribute  value  xki  over  all  classes.  A  summation  is  logical  as  the  zeros  for  the  AIM 
measures,  Ms  and  MN ,  occur  when  the  attribute  value  is  irrelevant  to  classification.  This 
contrasts  to  traditional  approach  where  P(xki  \  w} . )  =  0  in  fact  relates  to  a  case  of  perfect 
prediction  that  an  object  not  belong  to  class  m } .  The  measure  of  the  predictive  power  of  an 
attribute  value,  Mv  (xki ) ,  in  general  terms  is  given  by 

Mv  (xki )  =  £  F(M  s  (xki ,  m ,  ),Mn  (xki ,  tUj  ))fV.  (9) 

j= i 

where  F  is  some  function  of  the  measures  of  sufficiency  and  necessity,  and  W}  is  a  weight 
parameter.  An  obvious  choice  for  Wj  is  P{tu j )  .  A  convenient  form  of  the  function  F  is  a 
linear  combination  of  the  square  of  the  measures,  which  results  in  My  (xki )  being  zero 
when  all  the  individual  measures  are  zero  (that  is  for  total  irrelevance),  and  achieving  a 
maximum  of  one  when  the  individual  measures  are  all  at  their  maxima  (either  positive  or 
negative).  Such  a  function  is  given  in  Equation  9a. 

My  (Xki )  =  X  7  (MS  (■ xki  ,&j)2+MN  (xki ,  j 2 )  PitUj )  (9a) 

A  measure  of  the  usefulness  of  an  attribute  value  to  classification  is  dependent  on  how  that 
attribute  value  is  to  be  used  in  classification.  The  measure  presented  in  Equation  9a  gives 
equal  weighting  to  the  measures  of  sufficiency  and  necessity.  In  many  applications, 
however,  it  is  only  the  sufficiency  measure  that  will  be  of  use,  as  it  is  sufficiency  which 
maps  a  given  attribute  value  directly  to  a  class.  Therefore,  a  second  variation  for  the 
function  F  of  Equation  9  is  presented  below  for  use  in  applications  where  the  condition  of 
necessity  is  not  used  in  classification.  Note  that  the  modified  measure  maintains  its  range 
of  [0,1], 

My(x,)  =  fjMs(xkl,mj)2P(mj)  (9b) 

j= i 

3.3  The  Attribute  Measure 

Just  as  Mv{xki)  provides  a  measure  of  the  predictive  power  of  attribute  value  xki ,  a 
summation  scheme  over  all  an  attribute's  values  can  provide  a  measure  of  the 
classification  power  contained  in  the  attribute  as  a  whole,  xk .  For  example,  if  the 
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individual  values  of  an  attribute  all  have  Mv  (xki )  =  0,  then  none  of  them  provide  useful 
information  for  classification  and  therefore  the  attribute  xk  as  a  whole  has  no  predictive 
power.  The  measure  for  an  attribute  as  a  whole,  M A  (xk),  is  given  in  general  terms  by 

MA(xk)  =  YJMv(xkiWt  (10) 

1=1 

where  Wt  is  a  weight  parameter.  A  natural  choice  for  this  weight  is  P(xki ) ,  giving 


MA(xk)  =  '£dMv  (xki  )P(xki )  (10a) 

1=1 

The  bounds  for  M  A{xk)  from  Equation  10a  are  [0,1].  The  minimum  and  maximum  of 
M A  (xk )  are  achieved  when  the  sufficiency  and  necessity  measures  for  each  attribute  value 
in  each  class  are  M s (xki ,gJj)  =  Mn (xk. ,zzry)  =  0  and  Ms (xki , ery )  =  M N (xki , cry. )  =  ±1 
respectively.  Note  that  if  the  attribute  value  measures,  Mv  (xki ) ,  are  calculated  omitting 
the  necessity  condition,  as  in  Equation  9a,  then  the  corresponding  attribute  measure, 
M A{xk),  will  also  be  a  measure  based  only  on  sufficiency  results.  In  this  case  only  the 
sufficiency  measures  need  be  maximal  for  the  attribute  measure  to  achieve  a  value  of  one. 

Dimensionality  reduction  can  be  undertaken  based  on  a  ranking  of  attributes  according  to 
the  attribute  measure  M A  (xk ) . 


4.  The  Attribute  Measure  and  the  test  for 
independence 

In  the  preceding  section,  the  AIM  measure  for  the  attribute  as  a  whole,  MA,  was 
interpreted  as  being  a  measure  of  how  useful  the  attribute  is  to  classification.  The  ability  to 
perfectly  classify,  based  on  just  one  attribute,  is  achieved  when  M A  =  1 .  When  M A  «  0 , 
each  attribute  value  is  assumed  to  be  irrelevant  to  classification.  This  state  corresponds  to 
independence  between  the  attribute  values  and  the  classes,  and  therefore  may  be 
compared  with  traditional  tests  for  independence,  such  as  the  chi  squared  (%2)  test 
(Anderson  et  ai,  1994). 

It  can  be  shown  that  a  chi  squared  statistic  for  analysing  a  contingency  table  such  as  Figure 
1  is 
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x1  _yyf 

N  hti  P(xki)  ) 


P(mj)P(xki) 


(11) 


where  N  is  the  number  of  observations  and  the  probabilities  are  estimated  from  the 
contingency  table. 


Now  consider  M A  using  the  form  of  Mv  from  Equation  9b.  Substituting  Equations  6,  8 
and  9b  into  Equation  10a  gives 


P{xk,\mj)-P{xki)^ 

MA(xk)  =  Lls 

i=\  7=1 


D 


P(mj)P(xki) 


(12) 


where  D  is  the  denominator  of  Equations  6  and  8.  The  similarities  between  Equations  11 
and  12  are  obvious,  and  indeed  in  the  cases  where  P(xki  |  zzry )  <  P(xki)  the  terms  within 

the  summations  are  identical. 

It  is  known  that  the  asymptotic  distribution  of  the  chi  squared  statistic  is  independent  of 
the  distribution  of  the  classes  and  attribute  values.  Further,  it  is  dependent  only  on  the 
degrees  of  freedom  of  the  system  under  investigation,  which  in  this  case  is  the  sum  of  the 
number  of  rows  and  columns  in  the  contingency  table  less  two.  A  simulation  study  was 
undertaken  to  demonstrate  that  these  properties  also  apply  to  the  M A  measure. 


4.1  Simulation  study  of  Ma  under  independence 

In  the  simulation  study,  a  contingency  table  was  filled  according  to  predefined  probability 
distribution  functions.  These  functions,  nominally  labelled  "uniform",  "quadratic"  and 
"peak",  are  shown  in  Figure  3  for  a  10  by  10  table,  and  were  applied  independently  to  the 
distributions  of  the  classes  and  attribute  values.  That  is,  the  attribute  values  and  classes 
were  independent  of  each  other.  The  N  objects  in  the  contingency  table  were  therefore 
independently  identically  distributed  (i.i.d.).  With  the  table  filled,  the  M A  measure  was 
calculated  (using  Equation  12)  and  recorded.  This  procedure  was  repeated  10,000  times  to 
give  a  distribution  of  the  M A  measure.  This  whole  process  was  in  turn  repeated  a  number 
of  times,  varying  the  probability  distribution  functions  of  the  classes  and  attribute  values 
each  time. 

Table  2  gives  the  details  of  each  case  considered  in  the  simulation  study  for  a  10  by  10 
contingency  table  (18  degrees  of  freedom).  The  results  are  plotted  in  Figure  4  in  terms  of 
M A  N  to  be  dimensionally  consistent  with  chi  squared,  where  N  is  the  number  of  objects 
classified.  The  similarity  of  the  histograms  in  the  figure  suggests  the  populations  for  each 
case  are  identical.  Using  the  Mann-Whitney-Wilcoxon  test  (Anderson  et  al,  1994,  pp.  729- 
734),  the  hypothesis  that  each  of  the  populations  are  identical  was  retained  with  a  5% 


8 


DSTO-RR-0276 


significance  level  in  all  cases.  It  can  therefore  be  said  that  MAN  is  independent  of  the 
distribution  of  attribute  values  and  classes  in  the  case  where  these  are  independent  of  each 
other.  This  is  also  a  property  of  the  chi  squared  distribution. 


Table  2  Details  of  the  cases  used  in  the  simulations 


Case 

N 

Attribute  Value  pdf 

Class  pdf 

1 

10,000 

Uniform 

Uniform 

2 

10,000 

Peak 

Peak 

3 

10,000 

Quadratic 

Quadratic 

4 

10,000 

Quadratic 

Uniform 

5 

5,000 

Quadratic 

Quadratic 

0.35 


123456789  10 

Contingency  table  row/column  number 


Figure  3  Probability  density  functions  used  in  stochastic  analysis 
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Figure  4  Histogram  of  M AN  for  the  case  of  independence 

Further  study  used  differently  sized  contingency  tables,  which  therefore  had  different 
degrees  of  freedom.  It  was  found  that  the  distribution  of  M AN  varied  with  the  degrees  of 
freedom.  This  dependency  of  the  distribution  only  on  the  degrees  of  freedom  is  also  a 
property  of  chi  squared,  indicating  the  similarity  between  the  two  statistics. 


5.  Application  to  vectors  and  expressions 

The  derivation  of  the  measures  of  necessity  and  sufficiency  in  Section  3.1  was  expressed  in 
terms  of  a  single  attribute  value,  xki .  The  measures  for  vectors  of  attribute  values,  such  as 
Ms((x,9,x2I,*35),C77.)  and  Mw((x19,x2l,x35),BJy),  or  expressions,  such  as 
Ms((X|9  Ux2|),G7y),  are  now  considered.  They  rely  on  substitution  of  P(xk!)  and 
P(xki  \m)  in  Equations  6  to  8  with  corresponding  expressions  for  the  combined  attribute 
value  vector  or  expression.  In  the  case  of  a  vector  of  attribute  values, 
x  =(x1|.,...,xw,...,xgi),  the  substituted  expressions  are  P(xg)  and  P(xg  \  m j) .  The 

assumptions  of  dependence  or  independence  made  when  calculating  these  values  have 
repercussions  for  the  AIM  method,  and  are  therefore  addressed  below.  Note  that,  for 
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consistency,  P(xg)  and  P(x  g  \  ftj  j )  should  both  be  calculated  under  the  same  assumption, 
whether  this  be  independence  or  dependence. 

5.1  Independent  attributes 

If  the  occurrence  of  values  between  attributes  is  assumed  to  be  independent,  then  the 
following  is  true. 


k=l 


(13) 


Further,  the  conditional  probability  for  a  vector  of  attribute  values  and  a  class  is  commonly 
decomposed  thus  (Lewis,  1998): 

P(xg  \  S7.)  =  P((xu xki ,...,xgi )  |  mj  )  =  fl P(xki  |  mj )  (14) 

k= 1 

The  assumption  of  independence  is  useful  where  there  are  insufficient  samples  to 
adequately  calculate  the  probabilities  for  vectors  directly,  but  where  the  ability  to  compose 
the  probabilities,  as  in  Equations  13  and  14,  leads  to  acceptable  approximations.  Note  that 
in  cases  where  independence  is  assumed,  but  is  not  reasonable,  the  sufficiency  and 
necessity  measures  may  exceed  the  theoretical  bounds  of  [-1,1]. 

5.2  Dependent  attributes 

The  assumption  of  dependent  attribute  values  is  a  more  rigorous  approach  and  leads  to 
more  accurate  sufficiency  and  necessity  measures,  but  may  be  more  difficult  to  calculate. 
In  particular,  where  the  vector  length  is  long,  the  particular  combination  of  attribute 
values  being  considered  may  not  have  occurred  in  the  training  set,  thus  inhibiting  accurate 
estimation  of  the  probability. 


6.  A  decision  rule  search  algorithm 

A  vector  of  attribute  values  with  a  high  sufficiency  measure  for  a  class  can  be  thought  of  as 
a  decision  rule.  For  example,  if  Ms((xu,x21,x35,x,2),m])=  1,  then  the  rule 

(X\  |  ,x21  ,x35  ,  x42  )  —>  can  be  deduced.  Note  that  the  size  of  the  vector  need  not  be 
constant  between  decision  rules:  one  class  may  be  defined  by  a  single  attribute  value, 
while  another  may  need  information  from  all  of  the  available  attributes  before  a  decision 
rule  may  be  formulated.  Since  there  are  potentially  many  attributes,  each  with  many 
values,  the  total  number  of  possible  vectors  of  attribute  values  for  decision  rules  can  be 
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large.  Therefore,  an  efficient  method  by  which  these  decision  rules  may  be  identified  is  of 
interest  in  order  to  obviate  the  need  to  investigate  each  possible  attribute  value  vector. 

Consider  a  vector  of  dependent  attribute  values  which  has  a  high  sufficiency  measure  for  a 
given  class,  say  Ms  ((x, , ,  x27 ,  x35  ,  x42 ),  et,  )  =  1 .  In  this  case  any  object  with  the  attribute 

value  vector  given  can  be  confidently  classified  into  set  et,  .  If  the  attribute  values  are 
analysed  individually,  it  will  be  found  that  none  of  them  has  a  high  sufficiency  measure 
for  class  ej, .  However,  class  nr,  only  occurs  when  each  of  these  attribute  values  is  present. 
That  is,  it  is  necessary  for  an  object  to  have  each  of  xn,x27  ,x35  andx42  to  be  in  et,. 
Therefore,  each  attribute  value  will  individually  have  a  high  necessity  measure  for  et,  . 
This  demonstrates  that  amalgamating  dependent  attributes  values  with  high  necessity 
measures  leads  to  vectors  of  attribute  values  with  a  higher  sufficiency  measure  than  those 
of  the  individual  attribute  values.  Implicit  in  this  statement  is  that  attribute  values  with 
negative  necessity  measures  for  a  class  will  not  lead  to  a  higher  sufficiency  measure  when 
amalgamated  with  other  attribute  values.  These  principles  are  now  used  to  develop  an 
algorithm  which  searches  for  the  combinations  of  attribute  values  which  are  best  at 
classifying  objects,  that  is,  decision  rules. 

The  algorithm  presented  here  iterates  through  the  attribute  values,  amalgamating  those 
with  high  necessity  measures  until  a  high  sufficiency  measure  is  obtained  for  each  class. 
Those  combinations  which  have  a  negative  necessity  measure  are  eliminated  at  each  stage, 
thus  reducing  the  number  of  vectors  that  need  to  be  assessed.  The  algorithm  is  set  out 
below. 


DEFINE  the  lower  limit  for  the  sufficiency  measure,  M's,  above  which  objects  are 
accepted  for  classification.  This  will  be  dependent  on  the  acceptable  misclassification 
rate  in  particular  applications,  and  is  determined  by  the  user. 

DO  WHILE  j  <  S  (cycle  through  the  classes) 

DEFINE  vectors  of  attribute  values,  xg  .  The  initial  size  of  each  vector,  |xg  | ,  is  1, 

that  is  each  vector  corresponds  to  a  single  attribute  value.  The  subscript  g 
corresponds  to  each  unique  vector  of  the  attribute  values. 

DO  WHILE  there  are  unclassified  vectors. 

CALCULATE  Ms  (xg ,  wy )  and  MN  (xg ,  wy )  for  each  vector 

IF  Ms(xg,V7j)>M's  THEN 

CLASSIFY  vector  into  set  E7y 

REMOVE  vector  from  further  calculations  for  Wj 
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END  IF 

REMOVE  vectors  where  MN(xm)<  0  from  further  calculations  for  m . 

s  J  J 

FORM  new  vector  combinations,  using  the  attribute  values  in  the  remaining 

vectors,  to  increment  the  vector  length,  |xg  | ,  by  1.  The  new  vectors  cannot 

contain  subsets  which  are  vectors  that  have  been  eliminated  in  previous 
iterations. 

LOOP 

FOR  each  x  in  the  set  of  vectors  for  class  sj  . 

*  J 

IF  Ms (xg ,Wj)  =  Mn (xg ,Wj)  =  l  THEN 

REMOVE  all  other  vectors  for  ,  as  they  are  redundant  due  to  xg 

being  both  necessary  and  sufficient. 

END  IF 
LOOP 

LOOP 


7.  Special  case  of  binary  classification 

In  many  applications  the  number  of  classes  is  just  two,  S  =  2.  For  example,  when 
classifying  an  object  as  being  Good  or  Bad,  Acceptable  or  Unacceptable,  or  as  in  Mosteller 
&  Wallace  (1984)  when  investigating  disputed  authorship  between  two  writers.  In  these 
cases,  the  AIM  method  has  some  useful  properties,  which  can  simplify  analysis. 

If  we  label  the  two  classes  rn ,  and  m2 ,  then  the  following  is  true. 


P{VJ\  )  =  1  -  P(a 7 2 ) 

(15) 

and 

p(* 7i  K  )  =  l  ~P(ar2  !■««) 

(16) 

It  can  therefore  be  shown  that 

M  s  ( xk> .  m\ )  =  ~Ms{xu ,  m2 ) 

(17) 

Equations  8  and  17  show  that  Ms (xki , cj, ) ,  Ms(xki,m2 )  and  one  of  MN{xki,m,)  or 
M N  (xki ,  w 2 )  can  be  expressed  in  terms  of  Ms  {xki ,  ) . 
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To  reduce  the  dimension  of  the  problem,  one  of  the  classes  is  associated  with  positive  AIM 
measures,  while  the  alternative  class  corresponds  to  negative  sufficiency  and  necessity 
measures.  For  example,  the  set  of  "Good"  objects  might  be  defined  as  the  positive  set,  so 
that  a  positive  sufficiency  measure  indicates  that  an  object  in  more  likely  to  be  in  the 
"Good"  class,  while  a  negative  measure  reflects  that  the  object  belongs  to  the  "Bad"  class. 
The  sufficiency  measure  is  now  expressed  as  Ms(xki).  A  similar  measure  for  necessity, 
MN(xki),  may  be  calculated  assuming  adherence  to  mt  corresponds  to  positive 
sufficiency  and  necessity  measures. 

MN (xki )  =  M  N (xki , cr, )  if  P(xki  |  or, )  >  P(xki )  (18) 

MN ( xu )  =  -Mn (xki ,m2)  if  P(xki  |  C72 )  >  P(xki )  (19) 

Hence  it  is  possible  to  reduce  the  four  AIM  measures  for  the  two  classes  into  two 
measures,  Ms  (xfa. )  and  MN  (xki ) ,  where  the  sign  of  these  measures  is  used  to  indicate  to 
which  class  an  object  belongs. 

Reducing  the  number  of  sufficiency  and  necessity  measures  from  four  to  two  eliminates 
the  need  for  a  summation  scheme  when  calculating  Mv(xki) .  Simple  investigation  of 
Ms(xki )  and  M  N(xh)  indicates  the  relative  importance  of  each  attribute  value. 

The  expression  for  the  usefulness  of  an  attribute  as  a  whole  is  also  simplified  in  the  binary 
case.  In  cases  where  the  necessity  measure  is  not  appropriate  to  the  overall  measure  of  the 
usefulness,  as  in  Equation  9b,  Equations  15  and  17  can  be  used  to  reduce  M A  (xk )  to  the 
following  simple  summation. 


MA(xk)  =  TiMs  (.Xki ) 2  p(xu  ) 
1=1 


(20) 


8.  Comparison  with  existing  techniques 

The  publicly  available  "small"  soybean  dataset  (Murphy  &  Aha,  1984)  is  used  to  compare 
the  new  AIM  method  presented  here  with  existing  classification  and  dimensionality 
reduction  techniques.  First  used  by  Michalski  (1980)  and  Michalski  &  Stepp  (1983),  the 
dataset  contains  35  categorical  attributes  for  47  cases  or  instances.  These  instances  are 
classified  into  four  classes,  (D1,D2,D3,D4),  corresponding  to  different  diseases  in  soybean 
plants.  Each  attribute  has  between  one  and  seven  values,  and  there  are  no  missing  values. 

A  detailed  performance-based  comparison  between  the  AIM  method  and  a  large  number 
of  competing  algorithms  has  not  been  conducted  here.  The  main  advantage  of  the  AIM 
method  is  its  ability  to  combine  classification,  dimensionality  reduction  and  decision  rule 
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extraction  into  a  single  method.  Therefore,  the  purpose  of  this  section  is  to  demonstrate 
that  this  can  be  done  with  acceptable  results. 

Since  the  AIM  method  is  most  similar  to  a  Naive  Bayes  analysis,  it  is  against  this  method 
that  classification  is  compared.  The  "leave  one  out"  cross  validation  error  estimate  (James, 
1985,  pp.  77-78,  Hand  et  al.  2001,  pp.  360)  is  used,  due  to  the  small  sample  size. 
Classification  was  carried  out  using  a  vector  of  the  full  35  attributes,  which  were  assumed 
to  be  independent.  Using  this  method,  both  the  Naive  Bayes  and  AIM  methods 
misclassified  5  of  the  47  cases,  an  error  rate  of  11  %. 

The  AIM  classification  procedure  produced  measures  of  sufficiency,  Ms,  for  each 
attribute  value  in  each  class.  These  are  now  summed,  as  in  Equation  12,  to  provide  a 
measure  for  the  usefulness  of  each  attribute,  M A  .  This  is  compared  with  the  standard  chi 
squared  ( X 2 )  test,  given  in  Equation  11,  and  the  information  gain  method  used  in  decision 
tree  techniques  (Quinlan,  1986).  In  all  cases  a  high  score  indicates  the  attribute  is  useful  in 
classification.  The  results  are  presented  in  Table  3.  The  attribute  number  given  in  the 
leftmost  column  corresponds  to  the  order  in  which  the  data  is  provided  in  the  literature 
(Murphy  &  Aha,  1984).  The  ranking  of  attributes  based  on  their  AIM,  chi  squared  and 
information  gain  scores  is  also  provided  in  the  table.  The  results  show  that  the  three 
methods  are  in  close  agreement  in  their  ranking  of  the  bottom  21  attributes  (that  is,  rank  15 
and  beyond).  There  is  also  parity  in  the  identification  of  attributes  21  and  22  as  being 
clearly  the  most  useful.  The  correlation  between  the  scores  for  the  remaining  attributes  is 
also  strong  for  the  AIM  and  information  gain  methods. 

Having  ranked  attributes  in  order  of  importance  to  classification,  the  data  is  re-classified 
using  a  reduced  feature  set.  The  number  of  attributes  selected  from  Table  3  for  this  task 
may  be  found  using  cross  validation,  or  similar  methods.  This  process  is  not  specific  to  the 
AIM  method,  and  will  therefore  be  omitted  here.  It  is  obvious  from  Table  3  that  in  this  case 
attributes  21  and  22  are  clearly  superior  and  therefore  they  shall  be  the  features  selected 
for  the  re-classification  of  the  data.  The  "leave  one  out"  error  rate  is  again  used.  The  results 
of  re-classification  are  now  much  improved,  with  100%  correct  classification,  using  both 
the  AIM  method  and  Naive  Bayes.  This  demonstrates  that  the  dimensionality  reduction 
technique  available  through  the  AIM  algorithm  is  viable,  and  is  applicable  to  classification 
techniques  other  than  the  AIM  method. 
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The  AIM  decision  rule  search  algorithm  is  now  applied  to  attributes  21  and  22  from  the 
soybean  data.  This  is  a  simple  process,  as  most  values  for  these  attributes  map  directly  to 
classes  (hence  their  dominance  in  the  attribute  ranking).  With  a  vector  length  of  1, 
M s  (xti ,  tu j )  =  1  for  a  number  of  attribute  values,  implying  direct  mapping,  given  below. 


(Attribute  21  =  3)  ->  D1 

(21) 

(Attribute  21  =  0)  — >  D2 

(22) 

(Attribute  21  =  2)  ->  D4 

(23) 

(Attribute  22  =  3)  -►  D1 

(24) 

(Attribute  22  =  3)  D2 

(25) 

(Attribute  22  =  2)  -»  D4 

(26) 

The  necessity  measure  for  the  rules  in  Equations  22  and  25  are  both  1,  and  they  apply  to 
the  same  class.  Therefore,  one  of  them  maybe  removed,  as  it  is  redundant  to  classification 
into  D2.  Also,  the  necessity  measure  for  the  rule  in  Equation  21  is  1,  exceeding  that  of 
Equation  24,  so  that  the  latter  may  be  removed.  Equally,  the  necessity  measure  for 
Equation  26  is  unity,  greater  than  the  measure  for  Equation  23,  allowing  the  latter  to  be 
eliminated. 

Class  D3  is  not  represented  in  the  rules  above.  Therefore,  the  vector  length  is  increased  to 
2,  and  all  attribute  values  satisfying  M N(xkl,D3)  >  0  are  used  to  form  new  trial  vectors. 
This  applies  to  (Attribute  21  =  1)  and  (Attribute  22  =  1)  only.  Since 
Ms  ((Attribute  21  =  1,  Attribute  22  =  1),  D3)  =  1 ,  the  following  decision  rule  is  obtained. 

(Attribute  21  =  1,  Attribute  22  =  1)  D3  (27) 

Equations  21, 22, 26  and  27  provide  a  set  of  decision  rules  that  are  necessary  and  sufficient 
to  accurately  classify  any  new  case,  based  on  the  training  data. 


9.  Conclusion 


A  new  method  has  been  presented  for  classification  of  objects  with  categorical  data.  The 
Attribute  Importance  Measure  (AIM)  method  uses  logical  conditions  of  sufficiency, 
necessity  and  irrelevance  in  combination  with  probabilistic  techniques  to  provide  features 
which  are  not  available  in  traditional  Naive  Bayes  techniques.  One  such  feature  is  the 
integration  of  dimensionality  reduction  into  the  classification  algorithm. 

The  classification  functionality  of  the  algorithm  has  been  shown  to  be  comparable  with 
Naive  Bayes,  while  the  dimensionality  reduction,  or  feature  selection,  gives  similar  results 
to  a  chi  squared  test.  Indeed,  the  behaviour  of  the  AIM  attribute  statistic  for  independent 
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data  has  equivalent  properties  to  chi  squared,  such  as  independence  from  the  probability 
density  functions  and  a  variation  with  the  degrees  of  freedom. 

It  has  been  shown  that  the  AIM  method  provides  a  mechanism  by  which  searching  for 
vectors  of  attribute  values  which  are  useful  in  classification  (decision  rules)  may  be 
performed  efficiently. 

Finally,  the  various  aspects  of  the  AIM  method  have  been  applied  to  an  open  source  data 
set  and  compared  with  an  existing  classification  and  feature  selection  technique. 
Classification  and  dimensionality  reduction  performance  was  equivalent  between  the  AIM 
method  and  existing  techniques.  This  demonstrates  the  power  of  the  AIM  method  to 
combine  a  number  of  tasks  into  one  algorithm  for  real  data. 
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